pacman::p_load(tidyverse, sf, spdep, GWmodel, SpatialML, spatstat, units, matrixStats, ggpubr,
tmap, rsample, Metrics, httr, jsonlite, rvest, olsrr, corrplot, ggstatsplot)Take-home Exercise 3b: Geospatial Analytics
1 Overview
For take-home exercise 3, it consists of 2 options: Take-home Exercise 3a: Modelling Geography of Financial Inclusion with Geographically Weighted Methods & Take-home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods. The selected option for this take-home exercise will be 3b.
Housing plays a crucial role in household wealth across the globe, with purchasing a home representing a significant investment for most individuals. Housing prices are influenced by a variety of factors. Some of these factors are global, such as the overall economic conditions of a country or the inflation rate, while others are specific to individual properties. These factors can be categorised into structural and locational components.
Structural factors relate directly to the characteristics of the property, such as its size, amenities, and tenure.
Locational factors pertain to the surrounding environment, including proximity to childcare centres, public transportation, and shopping facilities.
Traditionally, predictive models for housing resale prices have been developed using the Ordinary Least Squares (OLS) method. However, this approach does not account for spatial autocorrelation and spatial heterogeneity present in geographic datasets, such as those related to housing transactions. When spatial autocorrelation is present, OLS estimates can produce biased, inconsistent, or inefficient results (Anselin 1998). To address this limitation, Geographically Weighted Models (GWMs) have been introduced, offering a more accurate approach to modelling and predicting housing resale prices. The steps involved will ultimately build a hedonic pricing models using this methods.
1.1 Task and Outcomes
For this take-home exercise, the primary dataset should be the HDB Resale Flat Prices available on data.gov.sg The analysis should concentrate on one specific flat type: three-room, four-room, or five-room flats.
The following is a list of suggested predictors to consider, though students are encouraged to include any other relevant independent variables that may enhance the analysis.
- Structural factors
- Area of the unit
- Floor level
- Remaining lease
- Age of the unit
- Main Upgrading Program (MUP) completed (optional)
- Locational factors
- Proximity to CBD
- Proximity to eldercare
- Proximity to foodcourt/hawker centres
- Proximity to MRT
- Proximity to park
- Proximity to good primary school
- Proximity to shopping mall
- Proximity to supermarket
- Numbers of kindergartens within 350m
- Numbers of childcare centres within 350m
- Numbers of bus stop within 350m
- Numbers of primary school within 1km
The four-room flats will be the chosen flat type for analysis as it is one of the most common HDB BTO flat types, which offers a comfortable living space for young couples and families.
Additionally, in this take-home exercise, we are tasked with calibrating a predictive model to forecast HDB resale prices for the period of July to September 2024, using resale transaction data from 2023 as the basis for analysis.
2 Installing and Loading R Packages
The code chunk below will ensure for a list of required R packages to be created, checked for installation, and installed if missing. Once installed, all packages will be loaded for use in the exercise.
3 HDB resale Data (Aspatial)
The following sections will consist of steps which import, process and wrangling of data.
Data used for this exercise is HDB Resale data: a list of HDB resale transacted prices in Singapore from Jan 2017 onwards. It is in csv format which can be downloaded from data.gov.sg.

When first downloaded, the data was labelled as ResaleflatpricesbasedonregistrationdatefromJan2017onwards. Hence, it was subsequently renamed to resale for ease of referencing and to avoid unnecessary mistakes. Similar to what is required in the task of using HDB resale transaction records in 2023 to predict HDB resale prices between July-September 2024 the code chunk below filters for transaction records for the entirety of 2023 and July till September 2024. Based on the requirements of this exercise, I have decided to focus my study on four-room flats.

resale <- read_csv("data/HDB/rawdata/resale.csv") %>%
filter(month >= "2023-01" & month <= "2023-12" | month %in% c("2024-07", "2024-08", "2024-09")) %>%
filter(flat_type == "4 ROOM")
The observations have been reduced to 14733 after reducing the area of study to soley four-room flats.
The code chunk below serves the functions of combining block and street_name variables to create a new and more complete variable address (excluding unit number) alongside remaining_lease_yr and remaining_lease_mth extracted from remaining_lease. This function will supplement our steps later on in creating the model.
resale_tidy <- resale %>%
mutate(address = paste(block,street_name)) %>%
mutate(remaining_lease_yr = as.integer(
str_sub(remaining_lease, 0, 2)))%>%
mutate(remaining_lease_mth = as.integer(
str_sub(remaining_lease, 9, 11)))Due to the addresses having same street names, a list is created with unique addresses to reduce the number of records and for the list to be passed through for API ingestion in the following steps. The sort function is used to reorder the records so that records will be in order. In essence, the code chunk below sorts a list of unique addresses to avoid the issue of repeated geocoding.
add_list <- sort(unique(resale_tidy$address))The following code chunks are used to obtain the postal code of the addresses using geocoding by passing the entries through the onemap API. This code chunk is credited to Prof. Kam where a HTTP address is sent to the OneMap API.
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
}
else {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}Following which, the get_coords() function is used to crawl the coordinates from the generated list
coords <- get_coords(add_list)The following code chunk will be used to save the results to avoid having to re-run the code chunks above which will take up additional time and resources.
write_rds(coords, "data/HDB/rds/coords.rds")coords <- read_rds('data/HDB/rds/coords.rds')3.1 Setting CRS for HDB Resale Data
The code chunk below first creates an sf object before the EPSG code is set for Singapore which is 3414
resale_tidy <- resale_tidy %>%
left_join(coords, by = c("address" = "address")) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = 3414)
write_rds(resale_tidy, "data/HDB/rds/resale.rds")resale_tidy <- read_rds("data/HDB/rds/resale.rds")3.2 Structural factors for HDB Resale units
- Structural factors
- Area of the unit
- Floor level
- Remaining lease
- Age of the unit
- Main Upgrading Program (MUP) completed (optional)
3.2.1 Floor Level
As one of the structural factors the code chunk below is used to view the floor levels in resale_tidy using the unique() function
unique(resale_tidy$storey_range) [1] "07 TO 09" "10 TO 12" "01 TO 03" "04 TO 06" "16 TO 18" "25 TO 27"
[7] "13 TO 15" "22 TO 24" "19 TO 21" "28 TO 30" "34 TO 36" "43 TO 45"
[13] "31 TO 33" "46 TO 48" "40 TO 42" "37 TO 39" "49 TO 51"
As the variable for storey_range is in string, transformation is done to mutate it as numeric. However, the storey_range as stated follows a range which will make it hard for analysis as the exact floor level is not stated. Hence, this numeric attribute will be transformed based on the median value as the chosen method ensuring that we have values to work with instead of a range. The code transforms the storey_range variable from a string representing a range into a numeric variable by calculating the median floor level for each range, enabling easier analysis.
resale_tidy <- resale_tidy %>%
mutate(
level = (as.numeric(str_extract(storey_range, "^[0-9]+")) +
as.numeric(str_extract(storey_range, "[0-9]+$"))) / 2
)Code chunk below is used to get a summary if the transformation is done correctly
summary(resale_tidy$level) Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 5.000 8.000 9.246 11.000 50.000
3.2.2 Remaining Lease & Age of Unit
We will also do some data wrangling for remaining_lease_yr & remaining_lease_mth. The code processes the remaining_lease_yr and remaining_lease_mth variables to first handle missing values, calculate the remaining lease in decimal years, and derive the unit’s age based on a 99-year HDB lease. The original year and month variables are then removed for a cleaner dataset.
resale_tidy <- resale_tidy %>%
mutate(
# Replace NA in months with 0 as observed in resale_tidy
remaining_lease_mth = if_else(is.na(remaining_lease_mth), 0, remaining_lease_mth),
# Calculate remaining lease in decimal years
remaining_lease = remaining_lease_yr + (remaining_lease_mth / 12),
# Age of unit calculation based on a HDB 99-year lease
unit_age = 99 - remaining_lease
) %>%
relocate(remaining_lease_yr, remaining_lease_mth, .after = last_col())4 Locational factors (Geospatial)
The following locational factors will be derived from their respective data sources such as from data.gov.sg for this exercise.
- Locational factors
- Proximity to CBD
- Proximity to elder care
- Proximity to hawker centres
- Proximity to MRT
- Proximity to park
- Proximity to CHAS Clinics
- Proximity to good primary school
- Proximity to supermarket
- Numbers of kindergartens within 350m
- Numbers of childcare centres within 350m
- Numbers of bus stop within 350m
Based on the locational factors above, we will import these geospatial data into the R environment. Firstly the Master Plan Subzone boundary data
mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL") %>%
st_transform(3414)Reading layer `MP14_SUBZONE_WEB_PL' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
mpszSimple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
1 1 1 MARINA SOUTH MSSZ01 Y MARINA SOUTH
2 2 1 PEARL'S HILL OTSZ01 Y OUTRAM
3 3 3 BOAT QUAY SRSZ03 Y SINGAPORE RIVER
4 4 8 HENDERSON HILL BMSZ08 N BUKIT MERAH
5 5 3 REDHILL BMSZ03 N BUKIT MERAH
6 6 7 ALEXANDRA HILL BMSZ07 N BUKIT MERAH
7 7 9 BUKIT HO SWEE BMSZ09 N BUKIT MERAH
8 8 2 CLARKE QUAY SRSZ02 Y SINGAPORE RIVER
9 9 13 PASIR PANJANG 1 QTSZ13 N QUEENSTOWN
10 10 7 QUEENSWAY QTSZ07 N QUEENSTOWN
PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D X_ADDR
1 MS CENTRAL REGION CR 5ED7EB253F99252E 2014-12-05 31595.84
2 OT CENTRAL REGION CR 8C7149B9EB32EEFC 2014-12-05 28679.06
3 SR CENTRAL REGION CR C35FEFF02B13E0E5 2014-12-05 29654.96
4 BM CENTRAL REGION CR 3775D82C5DDBEFBD 2014-12-05 26782.83
5 BM CENTRAL REGION CR 85D9ABEF0A40678F 2014-12-05 26201.96
6 BM CENTRAL REGION CR 9D286521EF5E3B59 2014-12-05 25358.82
7 BM CENTRAL REGION CR 7839A8577144EFE2 2014-12-05 27680.06
8 SR CENTRAL REGION CR 48661DC0FBA09F7A 2014-12-05 29253.21
9 QT CENTRAL REGION CR 1F721290C421BFAB 2014-12-05 22077.34
10 QT CENTRAL REGION CR 3580D2AFFBEE914C 2014-12-05 24168.31
Y_ADDR SHAPE_Leng SHAPE_Area geometry
1 29220.19 5267.381 1630379.3 MULTIPOLYGON (((31495.56 30...
2 29782.05 3506.107 559816.2 MULTIPOLYGON (((29092.28 30...
3 29974.66 1740.926 160807.5 MULTIPOLYGON (((29932.33 29...
4 29933.77 3313.625 595428.9 MULTIPOLYGON (((27131.28 30...
5 30005.70 2825.594 387429.4 MULTIPOLYGON (((26451.03 30...
6 29991.38 4428.913 1030378.8 MULTIPOLYGON (((25899.7 297...
7 30230.86 3275.312 551732.0 MULTIPOLYGON (((27746.95 30...
8 30222.86 2208.619 290184.7 MULTIPOLYGON (((29351.26 29...
9 29893.78 6571.323 1084792.3 MULTIPOLYGON (((20996.49 30...
10 30104.18 3454.239 631644.3 MULTIPOLYGON (((24472.11 29...
The extent of mpsz is shown by using st_bbox() of sf package.
st_bbox(mpsz) #view extent xmin ymin xmax ymax
2667.538 15748.721 56396.440 50256.334
4.1 Eldercare
eldercare <- st_read(dsn = "data/geospatial", layer = "ELDERCARE") %>%
st_transform(3414)Reading layer `ELDERCARE' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
4.2 CHAS Clinics
chas <- st_read("data/geospatial/CHASClinics.kml") %>%
st_transform(crs = 3414)Reading layer `MOH_CHAS_CLINICS' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\CHASClinics.kml'
using driver `KML'
Simple feature collection with 1193 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.5818 ymin: 1.016264 xmax: 103.9903 ymax: 1.456037
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
4.3 Childcare Centres
childcare <- st_read("data/geospatial/ChildCareServices.kml") %>%
st_transform(crs = 3414)Reading layer `CHILDCARE' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\ChildCareServices.kml'
using driver `KML'
Simple feature collection with 1925 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6878 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
4.4 Kindergartens
kindergartens <- st_read("data/geospatial/Kindergartens.geojson") %>%
st_transform(crs = 3414)Reading layer `Kindergartens' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\Kindergartens.geojson'
using driver `GeoJSON'
Simple feature collection with 448 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
4.5 Parks
parks <- st_read("data/geospatial/Parks.geojson") %>%
st_transform(crs = 3414)Reading layer `Parks' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\Parks.geojson'
using driver `GeoJSON'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
4.6 Hawker Centres
hawker_centre <- st_read("data/geospatial/HawkerCentresGEOJSON.geojson") %>%
st_transform(crs = 3414)Reading layer `HawkerCentresGEOJSON' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\HawkerCentresGEOJSON.geojson'
using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
4.7 Supermarkets
supermarkets <- st_read("data/geospatial/SupermarketsGEOJSON.geojson") %>%
st_transform(crs = 3414)Reading layer `SupermarketsGEOJSON' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\SupermarketsGEOJSON.geojson'
using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
4.8 Bus Stop Locations
bus_stops <- st_read(dsn = "data/geospatial", layer = "BusStop") %>%
st_transform(crs = 3414) %>%
filter(lengths(st_within(., mpsz)) > 0)Reading layer `BusStop' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21
4.9 MRT Accessibility
MRT <- st_read(dsn = "data/geospatial", layer = "RapidTransitSystemStation") %>%
st_transform(crs = 3414)Reading layer `RapidTransitSystemStation' from data source
`C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 230 features and 5 fields (with 1 geometry empty)
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
Sys.setenv(OGR_GEOMETRY_ACCEPT_UNCLOSED_RING = "NO")
MRT <- MRT[!st_is_empty(MRT), ]
# Convert Polygon to Point
MRT <- st_centroid(MRT)5 Processing Geospatial Data
In the previous section, we have loaded the geospatial data of interest it was also observed that some of this data consisted of the Z dimension. We will proceed to remove them as well as drop and unnecessary columns to reduce computation time and ensure geometries are valid.
chas <- st_zm(chas)
childcare <- st_zm(childcare)
kindergartens <- st_zm(kindergartens)
parks <- st_zm(parks)
hawker_centre <- st_zm(hawker_centre)
supermarkets <- st_zm(supermarkets)The code chunks below are used to remove columns not exactly needed to do analysis as the needed variables are generally the Name for identification and Geometry variables. Note that not all the columns selected are Column 1.
eldercare <- eldercare %>%
select(c(1))
chas <- chas %>%
select(c(1))
childcare <- childcare %>%
select(c(1))
kindergartens <- kindergartens %>%
select(c(1))
parks <- parks %>%
select(c(1))
hawker_centre <- hawker_centre %>%
select(c(1))
supermarkets <- supermarkets %>%
select(c(1))
bus_stops <- bus_stops %>%
select(c(1))
MRT <- MRT %>%
select(c(5))The code chunk below is used to ensure geometries are valid.
length(which(st_is_valid(mpsz) == FALSE))[1] 9
length(which(st_is_valid(eldercare) == FALSE))[1] 0
length(which(st_is_valid(chas) == FALSE))[1] 0
length(which(st_is_valid(childcare) == FALSE))[1] 0
length(which(st_is_valid(kindergartens) == FALSE))[1] 0
length(which(st_is_valid(parks) == FALSE))[1] 0
length(which(st_is_valid(hawker_centre) == FALSE))[1] 0
length(which(st_is_valid(supermarkets) == FALSE))[1] 0
length(which(st_is_valid(bus_stops) == FALSE))[1] 0
length(which(st_is_valid(MRT) == FALSE))[1] 0
mpsz has invalid 9 invalid geometries which we will fix using st_make_valid()
mpsz <- st_make_valid(mpsz)
length(which(st_is_valid(mpsz) == FALSE))[1] 0
The code chunk is ran again to do checks ensuring valid geometries
length(which(st_is_valid(mpsz) == FALSE))[1] 0
length(which(st_is_valid(eldercare) == FALSE))[1] 0
length(which(st_is_valid(chas) == FALSE))[1] 0
length(which(st_is_valid(childcare) == FALSE))[1] 0
length(which(st_is_valid(kindergartens) == FALSE))[1] 0
length(which(st_is_valid(parks) == FALSE))[1] 0
length(which(st_is_valid(hawker_centre) == FALSE))[1] 0
length(which(st_is_valid(supermarkets) == FALSE))[1] 0
length(which(st_is_valid(bus_stops) == FALSE))[1] 0
length(which(st_is_valid(MRT) == FALSE))[1] 0
We can see that they are all fixed.
6 Visualising the Data
In this section we will do quick visualisations without much customisations done just to ensure that the data is appropriate before proceeding.
tm_shape(mpsz) +
tm_polygons(col = "grey")
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(eldercare) +
tm_dots(col = "red", size = 0.1)
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(chas) +
tm_dots(col = "red")
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(childcare) +
tm_dots(col = "red")
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(kindergartens) +
tm_dots(col = "red", size = 0.08)
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(parks) +
tm_dots(col = "red", size = 0.08)
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(hawker_centre) +
tm_dots(col = "red", size = 0.1)
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(supermarkets) +
tm_dots(col = "red", size = 0.08)
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(bus_stops) +
tm_dots(col = "red", size = 0.005)
tm_shape(mpsz) +
tm_polygons(col = "grey") +
tm_shape(MRT) +
tm_dots(col = "red", size = 0.1)
Based on the above visualisations the data points look in place without any abnormalities.
7 Locational Factors (Proximity Calculation)
Now to calculate the proximity of HDB flats to relevant facilities. The provided proximity function - credits streamlines this process by calculating the minimum distance from each feature in df1 to the nearest feature in df2 and assigning this distance to a new column specified by varname.
The following locational factors will be calculated in terms of proximity:
- Proximity to CBD (Raffles Place, Tanjong Pagar MRT & City Hall MRT)
- Proximity to eldercare
- Proximity to CHAS Clinics
- Proximity to supermarket
- Proximity to hawker centres
- Proximity to supermarket
- Proximity to MRT
- Numbers of kindergartens within 350m
- Numbers of childcare centres within 350m
- Numbers of bus stop within 350m
proximity <- function(df1, df2, varname) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units()
# Calculate minimum distance for each row
df1[,varname] <- rowMins(dist_matrix)
return(df1)
}cbd <- filter(MRT, STN_NAM_DE %in% c("RAFFLES PLACE MRT STATION", "TANJONG PAGAR MRT STATION", "CITY HALL MRT STATION"))
resale_tidy <- proximity(resale_tidy, cbd, "PROX_CBD") %>%
proximity(., eldercare, "PROX_ELDERCARE") %>%
proximity(., chas, "PROX_CHAS") %>%
proximity(., hawker_centre, "PROX_HAWKER") %>%
proximity(., MRT, "PROX_MRT") %>%
proximity(., parks, "PROX_PARK") %>%
proximity(., childcare, "PROX_CHILDCARE") %>%
proximity(., kindergartens, "PROX_KINDERGARTEN") %>%
proximity(., supermarkets, "PROX_SUPERMARKET") %>%
proximity(., bus_stops, "PROX_BUS_STOP")We also need to calculate the number of facilities within a specific radius from the resale flats. The count_in_radius function accomplishes this by calculating the distance matrix between df1 and df2 using st_distance, identifying features within the specified radius, and summing these counts in a new column in df1 designated by varname.
count_in_radius <- function(df1, df2, varname, radius) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units() %>%
as.data.frame()
df1[,varname] <- rowSums(dist_matrix <= radius)
return(df1)
}The code chunk below is used to calculate the locational factors within a 350m radius
resale_tidy <- count_in_radius(resale_tidy, kindergartens, "NUM_KINDERGARTEN", 350) %>%
count_in_radius(., childcare, "NUM_CHILDCARE", 350) %>%
count_in_radius(., bus_stops, "NUM_BUS_STOP", 350) %>%
count_in_radius(., chas, "NUM_CHAS", 350)Calibrating predictive models are computationally intensive, especially when the random forest method is used. For quick prototyping, a smaller sample will be selected at random from the data by using the code chunk below.
resale_tidy consists of 14733 observations and 23 variables.
A sample of 5000 will be used.
set.seed(1234)
resale_tidy <- resale_tidy %>%
sample_n(5000)7.1 Checking of overlapping points
The code chunk below is used to check if there are overlapping point features.
overlapping_points <- resale_tidy %>%
mutate(overlap = lengths(st_equals(., .)) > 1)
Based on the results, it is observed that there are already multiple overlaps hence, in the code code chunk below, st_jitter() of sf package is used to move the point features by 3 metres to avoid overlapping point features.
resale_tidy <- resale_tidy %>%
st_jitter(amount = 3)The code chunk is run again to check for overlapping points.
overlapping_points <- resale_tidy %>%
mutate(overlap = lengths(st_equals(., .)) > 1)The results now show that there are no overlapping point features with no results shown when TRUE is selected for the overlap variable.
When using GWmodel to calibrate explanatory or predictive models, it is very important to ensure that there are no overlapping point features
write_rds(resale_tidy, "data/HDB/rds/int_resale.rds")resale_tidy <- read_rds("data/HDB/rds/int_resale.rds")Columns which are not needed for this analysis in this exercise will be dropped and the file is then saved as a RDS file for ease of retrieval without the need to run the above code chunks again.
resale_tidy <- resale_tidy %>%
rename(
MONTH = month,
TOWN = town,
FLOOR_AREA_SQM = floor_area_sqm,
ADDRESS = address,
RESALE_PRICE = resale_price,
LEVEL = level,
REMAINING_LEASE = remaining_lease,
UNIT_AGE = unit_age
) %>%
select(MONTH, TOWN, ADDRESS, REMAINING_LEASE, FLOOR_AREA_SQM, UNIT_AGE, RESALE_PRICE, LEVEL,
PROX_CBD, PROX_ELDERCARE, PROX_CHAS, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_CHILDCARE,
PROX_KINDERGARTEN, PROX_SUPERMARKET, PROX_BUS_STOP, NUM_KINDERGARTEN, NUM_CHILDCARE, NUM_BUS_STOP, NUM_CHAS)
write_rds(resale_tidy, "data/HDB/rds/final_resale.rds")Code chunk below is used for ease of retrieving the finalised resale_tidy data table
resale_tidy <- read_rds("data/HDB/rds/final_resale.rds")8 Exploratory Data Analysis (EDA)
In the section, it will make use of statistical graphics functions of ggplot2 package to perform EDA on resale_tidy
8.1 EDA using statistical graphics
8.1.1 Resale Price
ggplot(data = resale_tidy, aes(x = `RESALE_PRICE`)) +
geom_histogram(bins = 20, color = "black", fill = "salmon")
The figure above reveals a right skewed distribution. This means that more 4-room resale units were transacted at relative lower prices ranging from $400,000 to $600,000 price range.
8.1.2 Floor Area (SQM)
ggplot(data = resale_tidy, aes(x = `FLOOR_AREA_SQM`)) +
geom_histogram(bins = 20, color = "black", fill = "salmon")
In terms of floor area, the figure above reveals a right skewed distribution. This means that 4-room resale units were within the 90 to 100 sqm floor area.
8.1.3 Multiple Histogram Plots for Locational Factors
The code chunk below is used to create 12 histograms. Then, ggarrange() is used to organised the histograms into a 3 column by 4 row small multiple plot.
AREA_SQM <- ggplot(data = resale_tidy, aes(x= `FLOOR_AREA_SQM`)) +
geom_histogram(bins=20, color="black", fill="light blue")
AGE <- ggplot(data = resale_tidy, aes(x= `UNIT_AGE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
REMAINING_LEASE <- ggplot(data = resale_tidy, aes(x= `REMAINING_LEASE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_CBD <- ggplot(data = resale_tidy, aes(x= `PROX_CBD`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_CHILDCARE <- ggplot(data = resale_tidy, aes(x= `PROX_CHILDCARE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_ELDERCARE <- ggplot(data = resale_tidy, aes(x= `PROX_ELDERCARE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_HAWKER <- ggplot(data = resale_tidy, aes(x = `PROX_HAWKER`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_KINDERGARTEN <- ggplot(data = resale_tidy, aes(x = `PROX_KINDERGARTEN`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MRT <- ggplot(data = resale_tidy, aes(x= `PROX_MRT`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_PARK <- ggplot(data = resale_tidy, aes(x= `PROX_PARK`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_BUS_STOP <- ggplot(data = resale_tidy, aes(x= `PROX_BUS_STOP`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_SUPERMARKET <- ggplot(data = resale_tidy,
aes(x= `PROX_SUPERMARKET`)) +
geom_histogram(bins=20, color="black", fill="light blue")ggarrange(AREA_SQM, AGE, REMAINING_LEASE, PROX_CBD, PROX_CHILDCARE,
PROX_ELDERCARE, PROX_HAWKER, PROX_KINDERGARTEN, PROX_MRT,
PROX_PARK, PROX_BUS_STOP, PROX_SUPERMARKET,
ncol = 3, nrow = 4)
8.2 Statistical Point Map
We can also reveal the geospatial pattern distribution of four-room HDB resale prices and unit age sold in Singapore. The map will be prepared by using tmap package.
First, we will turn on the interactive mode of tmap by using the code chunk below.
tmap_mode("view")tm_shape(mpsz)+
tm_polygons() +
tm_shape(resale_tidy) +
tm_dots(col = "RESALE_PRICE",
alpha = 0.6,
style="quantile") +
tm_view(set.zoom.limits = c(11,14))set.zoom.limits argument of tm_view() sets the minimum and maximum zoom level to 11 and 14 respectively.
tm_shape(mpsz)+
tm_polygons() +
tm_shape(resale_tidy) +
tm_dots(col = "UNIT_AGE",
alpha = 0.6,
style="quantile") +
tm_view(set.zoom.limits = c(11,14))Before moving on to the next section, the code below will be used to turn R display into plot mode. This also helps to reduce rendering time.
tmap_mode("plot")9 Computing Correlation Matrix
Before loading the predictors into a predictive model, it is always a good practice to use correlation matrix to examine if there is sign of multicollinearity.
resale_tidy_nogeo <- resale_tidy %>%
st_drop_geometry()
corrplot::corrplot(cor(resale_tidy_nogeo[, 4:18]),
diag = FALSE,
order = "AOE",
tl.pos = "td",
tl.cex = 0.5,
method = "number",
type = "upper")
The correlation matrix above shows that all the correlation values are below 0.8 except UNIT_AGE and REMAINING_LEASE which is at -1.0 given that they are highly related since UNIT_AGE was derived with the 99 year lease deducted by REMAINING_LEASE. Hence, one of this variables will be dropped to ensure there is no high multicollinearity in the model.
The code chunk below uses ggcorrmat() function to compute the correlation matrix.
Note that
UNIT_AGEis not selected as it will be excluded in the subsequent model building section.
resale_tidy_nogeo <- resale_tidy %>%
st_drop_geometry()
ggstatsplot::ggcorrmat(resale_tidy_nogeo[, c(4:5, 7:18)])
10 Model Building
10.1 Data Sampling
The entire data is split into training and test data sets with resale transaction records in 2023 as the training data and July-September (Q3) 2024 resale transaction records respectively.
If the data is in a one year period and training and test data has to be split the it can be done by using initial_split() of rsample package. rsample is one of the package of tidymodels.
set.seed(1234)
resale_train <- resale_tidy %>%
filter(str_sub(MONTH, 1, 4) == "2023")
resale_test <- resale_tidy %>%
filter(str_sub(MONTH, 1, 4) == "2024")10.2 Building a Multiple Linear Regression
The code chunk below uses lm() to calibrate the multiple linear regression model.
resale.mlr <- lm(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
data = resale_train)
ols_regress(resale.mlr) Model Summary
--------------------------------------------------------------------------
R 0.884 RMSE 61822.812
R-Squared 0.781 MSE 3840136076.103
Adj. R-Squared 0.780 Coef. Var 10.638
Pred R-Squared 0.778 AIC 95262.990
MAE 47560.585 SBC 95381.722
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------
Regression 5.205812e+13 17 3.062242e+12 797.431 0.0000
Residual 1.461556e+13 3806 3840136076.103
Total 6.667368e+13 3823
------------------------------------------------------------------------------
Parameter Estimates
----------------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
----------------------------------------------------------------------------------------------------------
(Intercept) -42061.163 18317.692 -2.296 0.022 -77974.600 -6147.725
FLOOR_AREA_SQM 4229.999 155.882 0.219 27.136 0.000 3924.379 4535.619
LEVEL 5817.233 172.053 0.285 33.811 0.000 5479.908 6154.558
REMAINING_LEASE 5124.646 84.869 0.541 60.383 0.000 4958.253 5291.040
PROX_CBD -16.801 0.323 -0.534 -51.996 0.000 -17.435 -16.168
PROX_ELDERCARE -4.868 1.767 -0.023 -2.755 0.006 -8.332 -1.404
PROX_CHAS 68.923 11.149 0.054 6.182 0.000 47.066 90.781
PROX_HAWKER -17.655 2.145 -0.069 -8.231 0.000 -21.860 -13.449
PROX_MRT -29.076 2.910 -0.079 -9.991 0.000 -34.782 -23.370
PROX_PARK -8.359 2.517 -0.028 -3.322 0.001 -13.293 -3.425
PROX_CHILDCARE 14.435 15.630 0.007 0.924 0.356 -16.209 45.079
PROX_KINDERGARTEN -36.522 7.372 -0.049 -4.954 0.000 -50.975 -22.070
PROX_SUPERMARKET -2.209 7.360 -0.003 -0.300 0.764 -16.639 12.221
PROX_BUS_STOP 15.268 20.027 0.006 0.762 0.446 -23.996 54.531
NUM_KINDERGARTEN 4192.419 1337.767 0.033 3.134 0.002 1569.611 6815.227
NUM_CHILDCARE -3376.015 533.873 -0.059 -6.324 0.000 -4422.719 -2329.311
NUM_BUS_STOP 287.083 401.725 0.006 0.715 0.475 -500.534 1074.699
NUM_CHAS 7300.079 527.443 0.125 13.841 0.000 6265.981 8334.177
----------------------------------------------------------------------------------------------------------
Significant Predictors of Resale Price:
- Floor Area (
FLOOR_AREA_SQM):
With a positive and highly significant coefficient (p < 0.001), floor area strongly predicts resale prices. This aligns with common knowledge that larger HDB flats typically command higher resale values due to the increased living space.
- Level:
The level (floor) of the unit also shows a strong positive effect on resale prices (p < 0.001). Higher floors are often associated with better views, increased privacy, and less noise, which generally adds value.
Remaining Lease:
A positive coefficient for remaining lease (p < 0.001) suggests that resale prices tend to be higher for units with more years left on their lease, as these units offer longer occupancy potential before reaching the lease expiration.
- Proximity to MRT (
PROX_MRT):
While proximity to the MRT has a significant negative coefficient, indicating that being closer to MRT stations tends to increase prices, the distance is inversely coded (i.e., closer proximity lowers distance values). This is consistent with the desirability of accessible public transport, especially in Singapore.
- Proximity to Childcare (
PROX_CHILDCARE) and Kindergartens (PROX_KINDERGARTEN):
Both proximity to childcare and kindergartens are significant (p < 0.01), with negative coefficients, suggesting that families especially those with children value these amenities nearby.
- Other Amenities:
Several other amenities like Proximity to Parks (PROX_PARK) and Supermarkets (PROX_SUPERMARKET) are also significant, though their impact on resale prices is smaller compared to factors like floor area and level.
Insignificant Predictors:
- Proximity to Eldercare (
PROX_ELDERCARE) and Proximity to CHAS Clinics (PROX_CHAS):
These variables have relatively high p-values (p > 0.05), indicating that they may not be significant predictors of resale price in this model. Their lack of significance could suggest that proximity to eldercare and CHAS clinics does not strongly impact the value of HDB resale flats, possibly due to varying demand based on demographics.
- Proximity to Hawker Centres (
PROX_HAWKER):
Although hawker centres are important to residents for affordable food, this variable does not appear to be highly significant in affecting resale prices (p > 0.05), perhaps because it is less exclusive or ubiquitous.
Model Accuracy and Fit:
R-Squared: The model explains about 77.9% of the variability in resale prices, which is a strong indication of model fit and suggests that the included predictors capture the majority of factors influencing resale prices.
RMSE (Root Mean Square Error): The RMSE value of 63051.285 suggests the typical deviation between observed and predicted resale prices. While a lower RMSE would indicate better prediction, this value suggests a reasonable model fit given the complexities of housing prices.
F-statistic: The model’s F-statistic is highly significant (p < 0.001), supporting the overall model validity.
Observations:
Model Coefficients: The coefficients align with common real estate trends, with larger and higher-floor units being more valuable. Proximity to transport and family-oriented amenities also enhances resale value, reflecting a preference for accessible and family-friendly environments in Singapore.
Possible Multicollinearity: Given the number of variables, there may be multicollinearity among location-based predictors (e.g., proximity to MRT, parks, childcare). This can sometimes inflate standard errors and affect the interpretability of each predictor. Further diagnostics, such as Variance Inflation Factor (VIF), could help in evaluating multicollinearity which is done in the next step.
10.2.1 Checking for multicolinearity
In this section, it will introduce you an R package specially programmed for performing OLS regression. It is called olsrr. It provides a collection of very useful methods for building better multiple linear regression models:
- Comprehensive Regression Output
- Variable Selection Procedures
- Heteroskedasticity Tests
- Collinearity Diagnostics
- Model Fit Assessment
- Measures of Influence
- Residual Diagnostics
- Variable Contribution Assessment
In the code chunk below, the ols_vif_tol() of olsrr package is used to test if there are signs of multicollinearity.
ols_vif_tol(resale.mlr) Variables Tolerance VIF
1 FLOOR_AREA_SQM 0.8827152 1.132868
2 LEVEL 0.8133391 1.229499
3 REMAINING_LEASE 0.7187548 1.391295
4 PROX_CBD 0.5466303 1.829390
5 PROX_ELDERCARE 0.8121368 1.231320
6 PROX_CHAS 0.7485047 1.335997
7 PROX_HAWKER 0.8146946 1.227454
8 PROX_MRT 0.9323147 1.072599
9 PROX_PARK 0.7834653 1.276381
10 PROX_CHILDCARE 0.9057032 1.104114
11 PROX_KINDERGARTEN 0.5976016 1.673356
12 PROX_SUPERMARKET 0.7885493 1.268151
13 PROX_BUS_STOP 0.8668546 1.153596
14 NUM_KINDERGARTEN 0.5250241 1.904675
15 NUM_CHILDCARE 0.6711662 1.489944
16 NUM_BUS_STOP 0.7718257 1.295629
17 NUM_CHAS 0.7061126 1.416205
- 0 to 5: variables are not correlated
- 5 to 10: variables are correlated
- Greater than 10: variables are highly correlated
Based on the results of the Variance Inflation Factors (VIF) none of the variables are greater than 5. Each of the independent variables are calculated with another independent variable to attain the values above.
The analysis shows that multicollinearity is not a concern in this model, as all variables have VIF values significantly below the commonly accepted threshold. This suggests that each predictor contributes independently to the model, without redundancy that would impair interpretation or predictive power.
This shows no need to eliminate the variables.
note that there could be binary variables in datasets like Y/N options (dummy variables) which have some signs of correlation which are from the variable of lease properties: LEASEHOLD_99YR vs FREEHOLD etc.
10.2.2 Test for Non-Linearity
In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent
In the code chunk below, the ols_plot_resid_fit() of olsrr package is used to perform linearity assumption test.
ols_plot_resid_fit(resale.mlr)
While most of the data points are scattered around the 0 line; the plot also indicates slight issues with both linearity and homoscedasticity. The widening funnel shape at higher fitted values suggests that the model may not fully capture the complexity of higher-priced units, potentially indicating a need for transformation (e.g., log-transformation) or a different modelling approach to improve fit.
10.2.3 Test for Normality Assumption
Lastly, the code chunk below uses ols_plot_resid_hist() of olsrr package to perform normality assumption test.
ols_plot_resid_hist(resale.mlr)
The residuals do not follow a perfectly normal distribution, as evidenced by the slightly skewed shape while it has a normal curve. This could impact the reliability of significance tests and confidence intervals, especially for large residual values. This suggests that the model might tend to underpredict for certain observations, leading to larger residuals on the positive side.
10.3 Test for Spatial Autocorrelation
The hedonic model we are constructing incorporates geographically referenced attributes, making it essential to visualize the residuals of this pricing model.
To test for spatial autocorrelation, we need to convert condo_resale.sf from an sf dataframe into a SpatialPointsDataFrame.
First, we will extract the residuals from the hedonic pricing model and save them as a new dataframe.
mlr_res <- as.data.frame(resale.mlr$residuals)
# joining the newly created data frame with resale_train
resale_res <- cbind(resale_train,
mlr_res) %>%
rename(MLR_RES = resale.mlr.residuals)In this step it will convert resale.res from simple feature object into a SpatialPointsDataFrame because spdep package can only process sp conformed spatial data objects.
The code chunk below will be used to perform the data conversion process.
resale_sp <- as_Spatial(resale_res)
resale_spclass : SpatialPointsDataFrame
features : 3824
extent : 11656.47, 42446.44, 28217.48, 48684.45 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 23
names : MONTH, TOWN, ADDRESS, REMAINING_LEASE, FLOOR_AREA_SQM, UNIT_AGE, RESALE_PRICE, LEVEL, PROX_CBD, PROX_ELDERCARE, PROX_CHAS, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_CHILDCARE, ...
min values : 2023-01, ANG MO KIO, 10 CHAI CHEE RD, 46, 74, 3.66666666666667, 350000, 2, 311.741838971365, 0.000428110190953706, 7.28527615277263e-06, 39.5321340230547, 21.3638644944817, 77.0078791420734, 6.59828462646688e-05, ...
max values : 2023-12, YISHUN, 9A BOON TIONG RD, 95.3333333333333, 176, 53, 1500000, 47, 19126.5211737734, 3261.54359027314, 808.332738794272, 2777.29424402156, 2152.48536476722, 2398.40352003288, 547.386819517238, ...
write_rds(resale_sp, "data/HDB/rds/resale_sp.rds")resale_sp <- read_rds("data/HDB/rds/resale_sp.rds")Following which, tmap package will be used to display the distribution of the residuals on an interactive map.
The code churn below will turn on the interactive mode of tmap.
tmap_mode("view")tm_shape(mpsz) +
tmap_options(check.and.fix = TRUE) +
tm_polygons(col = "grey",
alpha = 0.4) +
tm_shape(resale_res) +
tm_dots(col = "MLR_RES",
alpha = 0.6,
style = "quantile") +
tm_layout(main.title = "Residuals Distribution",
main.title.position = "center",
main.title.size = 0.8) +
tm_view(set.zoom.limits = c(11,14))Switching back to “plot” mode before continuing.
tmap_mode("plot")Based on the figure above it reveals that there are signs of spatial autocorrelation.
To proof that our observation is indeed true, the Moran’s I test will be performed. First, to compute the distance-based weight matrix by using dnearneigh() function of spdep.
nb <- dnearneigh(coordinates(resale_sp), 0, 2000, longlat = FALSE)
summary(nb)Neighbour list object:
Number of regions: 3824
Number of nonzero links: 874036
Percentage nonzero weights: 5.977142
Average number of links: 228.5659
2 disjoint connected subgraphs
Link number distribution:
6 11 12 13 14 17 19 20 23 27 28 29 34 35 36 39 40 41 42 43
1 5 1 3 1 1 1 5 2 1 1 1 2 5 1 2 7 3 8 8
44 45 46 47 48 49 51 52 53 54 55 56 57 58 59 60 61 62 63 64
2 1 1 1 4 13 2 4 2 7 1 3 1 14 8 5 8 2 12 14
65 66 67 68 69 70 71 72 73 74 76 77 78 79 80 81 82 83 84 85
36 13 13 9 3 15 6 8 6 4 6 11 10 6 13 6 4 17 16 13
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
14 8 12 14 17 19 12 17 21 10 9 14 4 7 12 10 4 6 3 7
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
8 15 13 19 6 7 11 12 6 14 5 13 12 23 20 20 30 14 11 11
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
8 14 14 13 7 17 18 16 9 18 17 16 10 14 20 18 26 28 23 21
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
15 21 16 21 8 20 14 12 16 10 19 23 5 23 19 22 16 10 8 10
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
8 12 6 10 13 10 8 9 7 10 4 5 7 9 4 13 8 3 7 7
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
2 8 15 10 11 13 14 12 14 10 6 9 19 11 6 26 10 13 10 10
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
12 7 10 12 16 14 12 9 10 3 8 4 4 5 10 7 4 8 6 15
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
15 10 17 2 7 5 8 14 6 7 8 11 10 14 12 9 11 24 11 13
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
10 7 9 11 11 10 9 4 9 10 12 9 9 18 8 16 16 4 9 21
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
14 25 26 15 14 20 28 19 12 13 18 7 12 18 14 9 33 6 5 17
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
12 8 16 10 7 7 17 14 11 17 11 14 4 4 2 4 7 4 12 6
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326
5 5 1 6 7 8 7 9 2 5 4 2 9 2 4 9 14 17 4 9
327 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 344 345 347 348
2 13 14 3 3 3 4 3 5 5 2 4 2 2 3 1 6 1 2 2
349 350 352 353 354 355 356 357 358 360 361 362 364 365 366 367 368 371 372 373
3 1 3 1 1 3 2 8 3 2 2 2 2 1 3 1 3 8 3 7
374 375 376 377 378 379 380 381 382 383 384 386 387 388 389 390 391 392 394 395
3 7 4 6 12 8 3 3 4 4 4 9 1 6 2 1 4 2 1 8
396 400 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 418 420 421
4 5 3 1 3 1 4 4 4 6 10 5 1 3 1 6 3 3 3 3
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 439 441 442 443 444
6 2 3 9 3 8 1 4 5 7 3 2 3 1 2 4 3 4 6 4
446 448 449 451 452 455 456 457 459 460 461 462 463 465 466 467 469 470 471 472
3 5 2 3 4 3 3 2 4 6 5 3 3 2 3 2 1 1 5 1
473 474 475 476 477 479 481 482 484 486 487 488 489 491 492 493 494 495 496 497
4 5 1 3 1 3 5 3 1 3 5 5 8 5 1 5 2 5 2 3
498 500 505 506 508 509 510 511 512 513 514 515 516 518 519 520 521 522 523 526
2 6 1 2 5 1 2 4 2 5 3 2 5 2 4 3 1 2 3 3
527 528 529 530 531 534 535 536 538 539 540 541 542 543 544 545 546 548 549 550
3 5 1 1 1 1 4 1 2 1 4 4 3 1 3 6 4 2 1 1
551 553 554 555 556 557 560 561 562 563 564 566 567 569 570 571 572 573 576 579
1 1 3 5 2 3 2 2 6 5 2 7 2 1 3 4 2 2 1 2
581 593 594 595 598 601 604
2 2 1 1 1 2 1
1 least connected region:
1732 with 6 links
1 most connected region:
3263 with 604 links
Following which, nb2listw() of spdep package will be used to convert the output neighbours lists (i.e. nb) into spatial weights.
nb_lw <- nb2listw(nb, style = 'W')
summary(nb_lw)Characteristics of weights list object:
Neighbour list object:
Number of regions: 3824
Number of nonzero links: 874036
Percentage nonzero weights: 5.977142
Average number of links: 228.5659
2 disjoint connected subgraphs
Link number distribution:
6 11 12 13 14 17 19 20 23 27 28 29 34 35 36 39 40 41 42 43
1 5 1 3 1 1 1 5 2 1 1 1 2 5 1 2 7 3 8 8
44 45 46 47 48 49 51 52 53 54 55 56 57 58 59 60 61 62 63 64
2 1 1 1 4 13 2 4 2 7 1 3 1 14 8 5 8 2 12 14
65 66 67 68 69 70 71 72 73 74 76 77 78 79 80 81 82 83 84 85
36 13 13 9 3 15 6 8 6 4 6 11 10 6 13 6 4 17 16 13
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
14 8 12 14 17 19 12 17 21 10 9 14 4 7 12 10 4 6 3 7
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
8 15 13 19 6 7 11 12 6 14 5 13 12 23 20 20 30 14 11 11
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
8 14 14 13 7 17 18 16 9 18 17 16 10 14 20 18 26 28 23 21
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
15 21 16 21 8 20 14 12 16 10 19 23 5 23 19 22 16 10 8 10
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
8 12 6 10 13 10 8 9 7 10 4 5 7 9 4 13 8 3 7 7
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
2 8 15 10 11 13 14 12 14 10 6 9 19 11 6 26 10 13 10 10
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
12 7 10 12 16 14 12 9 10 3 8 4 4 5 10 7 4 8 6 15
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
15 10 17 2 7 5 8 14 6 7 8 11 10 14 12 9 11 24 11 13
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
10 7 9 11 11 10 9 4 9 10 12 9 9 18 8 16 16 4 9 21
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
14 25 26 15 14 20 28 19 12 13 18 7 12 18 14 9 33 6 5 17
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
12 8 16 10 7 7 17 14 11 17 11 14 4 4 2 4 7 4 12 6
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326
5 5 1 6 7 8 7 9 2 5 4 2 9 2 4 9 14 17 4 9
327 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 344 345 347 348
2 13 14 3 3 3 4 3 5 5 2 4 2 2 3 1 6 1 2 2
349 350 352 353 354 355 356 357 358 360 361 362 364 365 366 367 368 371 372 373
3 1 3 1 1 3 2 8 3 2 2 2 2 1 3 1 3 8 3 7
374 375 376 377 378 379 380 381 382 383 384 386 387 388 389 390 391 392 394 395
3 7 4 6 12 8 3 3 4 4 4 9 1 6 2 1 4 2 1 8
396 400 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 418 420 421
4 5 3 1 3 1 4 4 4 6 10 5 1 3 1 6 3 3 3 3
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 439 441 442 443 444
6 2 3 9 3 8 1 4 5 7 3 2 3 1 2 4 3 4 6 4
446 448 449 451 452 455 456 457 459 460 461 462 463 465 466 467 469 470 471 472
3 5 2 3 4 3 3 2 4 6 5 3 3 2 3 2 1 1 5 1
473 474 475 476 477 479 481 482 484 486 487 488 489 491 492 493 494 495 496 497
4 5 1 3 1 3 5 3 1 3 5 5 8 5 1 5 2 5 2 3
498 500 505 506 508 509 510 511 512 513 514 515 516 518 519 520 521 522 523 526
2 6 1 2 5 1 2 4 2 5 3 2 5 2 4 3 1 2 3 3
527 528 529 530 531 534 535 536 538 539 540 541 542 543 544 545 546 548 549 550
3 5 1 1 1 1 4 1 2 1 4 4 3 1 3 6 4 2 1 1
551 553 554 555 556 557 560 561 562 563 564 566 567 569 570 571 572 573 576 579
1 1 3 5 2 3 2 2 6 5 2 7 2 1 3 4 2 2 1 2
581 593 594 595 598 601 604
2 2 1 1 1 2 1
1 least connected region:
1732 with 6 links
1 most connected region:
3263 with 604 links
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 3824 14622976 3824 47.67099 15430.41
Lastly, lm.morantest() of spdep package will be used to perform Moran’s I test for residual spatial autocorrelation to validate if our inital observation if there are signs of spatial autocorrelation
moran_test <-lm.morantest(resale.mlr, nb_lw)The Moran’s I test for residual spatial autocorrelation was conducted using the lm.morantest() function. Here’s a breakdown of the results:
Moran I Statistic:
The observed Moran’s I value is 0.299, which indicates a positive spatial autocorrelation in the residuals. This suggests that similar values tend to be spatially clustered, which is a sign of spatial dependence that hasn’t been fully captured by the regression model.
Significance (p-value): The p-value is extremely low (p-value < 2.2e-16) which strongly rejects the null hypothesis of no spatial autocorrelation. This statistically confirms that there is significant spatial autocorrelation in the residuals of the model.
write_rds(moran_test, "data/HDB/rds/moran_resale_sp.rds")moran_test <- read_rds("data/HDB/rds/moran_resale_sp.rds")11 Building Geographical Random Forest Model
To prepare for calibrating the random forest model, we first need to generate coordinate data required by the SpatialML package. This can be accomplished by using st_coordinates() from the sf package. After extracting the coordinates, we will remove the geometry from resale_train using st_drop_geometry(), creating a non-spatial dataset ready for analysis.
coords_train <- st_coordinates(resale_train)
coords_train <- write_rds(coords_train, "data/HDB/rds/coords_train.rds") # writing output into rds for future use
coords_train <- read_rds("data/HDB/rds/coords_train.rds")Using st_drop_geometry()
resale_train_nogeo <- resale_train %>%
st_drop_geometry()11.1 Computing adaptive bandwidth
bw.gwr() of GWmodel package will be used to determine the optimal bandwidth to be used for the Random Forest Model. An adaptive bandwidth will be applied to allow the bandwidth to adjust based on the density of data points, offering greater flexibility in regions with sparse data. Also, to determine the optimal bandwidth, we will use a cross-validation (CV) approach with the bw.gwr() function as well.
bw_adaptive <- bw.gwr(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
data = resale_sp,
approach = "CV",
kernel = "gaussian",
adaptive = TRUE,
longlat = FALSE)Take a cup of tea and have a break, it will take a few minutes.
-----A kind suggestion from GWmodel development group
Adaptive bandwidth: 2371 CV score: 1.368532e+13
Adaptive bandwidth: 1473 CV score: 1.288663e+13
Adaptive bandwidth: 918 CV score: 1.161493e+13
Adaptive bandwidth: 574 CV score: 1.015948e+13
Adaptive bandwidth: 363 CV score: 8.471624e+12
Adaptive bandwidth: 230 CV score: 7.183122e+12
Adaptive bandwidth: 151 CV score: 5.939131e+12
Adaptive bandwidth: 98 CV score: 5.153881e+12
Adaptive bandwidth: 70 CV score: 4.78133e+12
Adaptive bandwidth: 47 CV score: 4.624679e+12
Adaptive bandwidth: 39 CV score: 4.564377e+12
Adaptive bandwidth: 27 CV score: 8.95862e+12
Adaptive bandwidth: 39 CV score: 4.564377e+12
The result shows that 64 neighbour points will be the optimal bandwidth to be used if adaptive bandwidth is used for this data set.
write_rds(bw_adaptive, "data/HDB/model/bw_adaptive.rds")bw_adaptive <- read_rds("data/HDB/model/bw_adaptive.rds")
bw_adaptive[1] 39
11.2 Calibrating Random Forest Model
With the results from the adaptive bandwidth computed earlier on, this section will illustrate the calibration of a Geographically Weighted Random Forest Model using grf() of SpatialML package.
Code chunk below is used to verify the variable names before proceeding.
names(resale_train_nogeo) [1] "MONTH" "TOWN" "ADDRESS"
[4] "REMAINING_LEASE" "FLOOR_AREA_SQM" "UNIT_AGE"
[7] "RESALE_PRICE" "LEVEL" "PROX_CBD"
[10] "PROX_ELDERCARE" "PROX_CHAS" "PROX_HAWKER"
[13] "PROX_MRT" "PROX_PARK" "PROX_CHILDCARE"
[16] "PROX_KINDERGARTEN" "PROX_SUPERMARKET" "PROX_BUS_STOP"
[19] "NUM_KINDERGARTEN" "NUM_CHILDCARE" "NUM_BUS_STOP"
[22] "NUM_CHAS"
names(resale_train) [1] "MONTH" "TOWN" "ADDRESS"
[4] "REMAINING_LEASE" "FLOOR_AREA_SQM" "UNIT_AGE"
[7] "RESALE_PRICE" "LEVEL" "PROX_CBD"
[10] "PROX_ELDERCARE" "PROX_CHAS" "PROX_HAWKER"
[13] "PROX_MRT" "PROX_PARK" "PROX_CHILDCARE"
[16] "PROX_KINDERGARTEN" "PROX_SUPERMARKET" "PROX_BUS_STOP"
[19] "NUM_KINDERGARTEN" "NUM_CHILDCARE" "NUM_BUS_STOP"
[22] "NUM_CHAS" "geometry"
str(resale_train_nogeo)tibble [3,824 × 22] (S3: tbl_df/tbl/data.frame)
$ MONTH : chr [1:3824] "2023-08" "2023-09" "2023-08" "2023-09" ...
$ TOWN : chr [1:3824] "TAMPINES" "JURONG WEST" "PUNGGOL" "KALLANG/WHAMPOA" ...
$ ADDRESS : chr [1:3824] "879A TAMPINES AVE 8" "681C JURONG WEST CTRL 1" "632A PUNGGOL DR" "2C UPP BOON KENG RD" ...
$ REMAINING_LEASE : num [1:3824] 92.9 75.9 83.4 82.2 93.6 ...
$ FLOOR_AREA_SQM : num [1:3824] 93 95 93 90 93 93 93 105 102 92 ...
$ UNIT_AGE : num [1:3824] 6.08 23.08 15.58 16.83 5.42 ...
$ RESALE_PRICE : num [1:3824] 702888 525000 560000 880000 600000 ...
$ LEVEL : num [1:3824] 11 14 14 14 8 14 11 2 8 20 ...
$ PROX_CBD : num [1:3824] 10722 17409 13699 3085 17099 ...
$ PROX_ELDERCARE : num [1:3824] 453 792 1220 885 1859 ...
$ PROX_CHAS : num [1:3824] 74.6 157.3 230.9 104.2 59.3 ...
$ PROX_HAWKER : num [1:3824] 1203 893 1260 211 1016 ...
$ PROX_MRT : num [1:3824] 1254 667 199 197 360 ...
$ PROX_PARK : num [1:3824] 1591 789 1369 755 502 ...
$ PROX_CHILDCARE : num [1:3824] 74.6 49.3 160.6 99.8 55.5 ...
$ PROX_KINDERGARTEN: num [1:3824] 304 285 419 146 327 ...
$ PROX_SUPERMARKET : num [1:3824] 74.6 497.4 241.9 103.7 341.9 ...
$ PROX_BUS_STOP : num [1:3824] 73.1 170 140 28.6 124.7 ...
$ NUM_KINDERGARTEN : num [1:3824] 1 1 0 1 1 1 1 0 1 0 ...
$ NUM_CHILDCARE : num [1:3824] 4 5 5 4 6 11 5 2 7 5 ...
$ NUM_BUS_STOP : num [1:3824] 10 11 7 9 9 7 8 6 11 4 ...
$ NUM_CHAS : num [1:3824] 1 1 4 5 5 2 4 0 3 3 ...
The code chunk below is used to calibrate a model to predict HDB four-room resale price by using random forest function of ranger package.
set.seed(1234)
rf <- ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
data = resale_train_nogeo)
rfRanger result
Call:
ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE + PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS, data = resale_train_nogeo)
Type: Regression
Number of trees: 500
Sample size: 3824
Number of independent variables: 17
Mtry: 4
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 1803594630
R squared (OOB): 0.8965837
write_rds(rf, "data/HDB/model/rf.rds")rf <- read_rds("data/HDB/model/rf.rds")
rfRanger result
Call:
ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE + PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS, data = resale_train_nogeo)
Type: Regression
Number of trees: 500
Sample size: 3824
Number of independent variables: 17
Mtry: 4
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 1803594630
R squared (OOB): 0.8965837
The model output is also saved in the following code chunk below.
Number or trees (e.g.,
ntree = 50)200 is selected because from earlier experiments with 500 trees, the gain in accuracy was minimal (a slight improvement in R-squared and OOB MSE). Lower number of trees can be a good compromise for faster computation and reducing complexity.Ideally we should let the model run and determine the number of trees which might be up to 500.
kernel = "adaptive"ensures that the model adjusts spatial influence based on local data density
verbose = TRUEis helpful for debugging and monitoring the model’s progress.
pacman::p_load(parallel, profvis)set.seed(1234)
gwRF_adaptive <- grf(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
dframe = resale_train_nogeo,
bw = bw_adaptive, # based on earlier computed adaptive bandwidth
ntree = 200,
kernel = "adaptive",
verbose = TRUE,
coords = coords_train)Ranger result
Call:
ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE + PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS, data = resale_train_nogeo, num.trees = 200, mtry = 5, importance = "impurity", num.threads = NULL, verbose = TRUE)
Type: Regression
Number of trees: 200
Sample size: 3824
Number of independent variables: 17
Mtry: 5
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 1737145553
R squared (OOB): 0.9003939
FLOOR_AREA_SQM LEVEL REMAINING_LEASE PROX_CBD
3.263165e+12 1.047779e+13 1.488454e+13 2.077840e+13
PROX_ELDERCARE PROX_CHAS PROX_HAWKER PROX_MRT
2.260927e+12 8.291013e+11 2.772082e+12 2.192365e+12
PROX_PARK PROX_CHILDCARE PROX_KINDERGARTEN PROX_SUPERMARKET
1.863344e+12 7.977834e+11 1.113445e+12 1.092576e+12
PROX_BUS_STOP NUM_KINDERGARTEN NUM_CHILDCARE NUM_BUS_STOP
8.380639e+11 3.849591e+11 5.332950e+11 5.872851e+11
NUM_CHAS
1.461604e+12
Min. 1st Qu. Median Mean 3rd Qu. Max.
-402888.9 -21537.6 -488.3 -917.3 21066.1 649147.3
Min. 1st Qu. Median Mean 3rd Qu. Max.
-114283.74 -3516.67 -56.18 -32.67 3518.50 103936.38
Min Max Mean StD
FLOOR_AREA_SQM 0 327369341307 18564730799 36283978234
LEVEL 179517210 237066921698 17357244435 26558883550
REMAINING_LEASE 476456004 610467337521 46169698946 77798113135
PROX_CBD 159160269 398838028133 13601543671 29690681093
PROX_ELDERCARE 229472032 538119933435 12116733479 26711034015
PROX_CHAS 169045122 164752719384 7669699018 12780222753
PROX_HAWKER 205849444 212345427516 11246287515 20680809603
PROX_MRT 197451249 325520355339 11680829832 23418978745
PROX_PARK 150692058 324539620889 12175408296 22684720379
PROX_CHILDCARE 161726549 332142137300 9466247889 21304593306
PROX_KINDERGARTEN 168923960 497657122214 11349619493 30649799599
PROX_SUPERMARKET 173000086 226481570982 9960133501 20427808278
PROX_BUS_STOP 97177838 221479676375 8236757119 20192427806
NUM_KINDERGARTEN 0 107244057163 1587110186 4979233427
NUM_CHILDCARE 15035261 203734234177 5476256383 14279107513
NUM_BUS_STOP 34946054 304540033154 6539865575 19902376737
NUM_CHAS 0 262309235853 5966025174 16453101580
Based on the results and parameters, it can be observed that
Top 5 Predictors by Importance The top six predictors, based on their importance scores, are as follows: 1. FLOOR_AREA_SQM: Interpretation: Floor area is the most important predictor in the model, which aligns with the general understanding that larger flats tend to have higher resale values. Geographical Implication: This factor is likely uniformly significant across areas, as floor area directly impacts the valuation irrespective of location.
LEVEL: Interpretation: The floor level of the unit is also highly significant, likely due to the premium placed on higher floors for better views, privacy, and less noise.REMAINING_LEASE:
Interpretation: The remaining lease length strongly influences resale value, as longer leases provide more value to buyers, especially in Singapore where HDB leases are finite.
PROX_CBD:
Interpretation: Proximity to the Central Business District (CBD) is a major driver of value, as it suggests better access to job opportunities, amenities, and transport options.
Secondary
PROX_HAWKER: Interpretation: Proximity to hawker centres, an important amenity in Singapore, significantly impacts HDB resale prices, as they offer affordable dining options.PROX_MRT:
Interpretation: Proximity to MRT stations is another key factor, as convenient public transportation access is highly valued in Singapore’s urban environment.
PROX_PARKandPROX_ELDERCAREalso contribute, albeit to a lesser degree, highlighting the value of recreational spaces and proximity to eldercare facilities.
The top predictors identified by this geographically weighted random forest model, particularly are floor area, level, and remaining lease, aligning with general expectations in HDB pricing. Location-specific amenities (e.g., proximity to CBD, MRT, and hawker centres) further emphasize the premium placed on convenience and accessibility in Singapore’s real estate market. The high R-squared value and reasonable OOB MSE suggest a robust model that captures both structural and spatial elements effectively.
- Less influential variables:
Variables like NUM_KINDERGARTEN, PROX_CHILDCARE, and NUM_BUS_STOP have lower importance values, suggesting they contribute less to resale price variation.
- R-squared (Not OOB): 98.559% This indicates that the model explains 98.56% of the variance in the resale prices. This seems to be a good fit for the data, suggesting the model is effective at capturing key predictors.
11.2.1 Saving the model output
Code chunk below is used to save the results as an rds file.
write_rds(gwRF_adaptive, "data/HDB/model/gwRF_adaptive.rds")gwRF_adaptive <- read_rds("data/HDB/model/gwRF_adaptive.rds")12 Model Evaluation (Predicting using test data)
To begin, we will prepare the test dataset, consisting of transactions from (Q3) July to September 2024. Like the training data, the test data requires coordinate information for use with the SpatialML package. These coordinates can be extracted using st_coordinates() from the sf package. Afterwards, the geometry data will be removed from resale_test using st_drop_geometry(), resulting in an aspatial dataset ready for analysis.
coords_test <- st_coordinates(resale_test)
coords_test <- write_rds(coords_test, "data/HDB/rds/coords_test.rds")
resale_test_nogeo <- cbind(resale_test, coords_test) %>%
st_drop_geometry()12.1 Predicting with the test data
The Multiple Linear Regression (MLR) will first be used to make predictions based on the test data using the function predict() with the results stored in the test dataset for model evaluation afterwards.
resale_test$MLR_PREDICT <- predict(object = resale.mlr, newdata = resale_test)12.2 Predicting with Geographically Weighted Random Forest
Next, predict.grf() of spatialML package will be used to predict the resale value by using the resale_test data and gwRF_adaptive model calibrated earlier.
resale_test$GWRF_PREDICT <- predict.grf(gwRF_adaptive,
resale_test_nogeo,
x.var.name = "X",
y.var.name = "Y",
local.w = 1,
global.w = 0)Before moving on the output is saved as an rds file for future usage
GRF_pred <- write_rds(resale_test, "data/HDB/rds/resale_test_pred.rds")GRF_pred <- read_rds("data/HDB/rds/resale_test_pred.rds")13 Calculation of Root Mean Square Error (RMSE)
The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis. In the code chunk below, rmse() of Metrics package is used to compute the RMSE.
The output of the predict.grf() is a vector of predicted values. It is converted into a data frame for further visualisation and analysis.
GRF_pred_df <- as.data.frame(GRF_pred)13.1 RMSE for Geographically Weighted Random Forest
rmse_gwrf <- rmse(resale_test$RESALE_PRICE,
resale_test$GWRF_PREDICT)
rmse_gwrf[1] 84847.19
13.2 RMSE for Multiple Linear Regression
rmse_mlr <- rmse(resale_test$RESALE_PRICE,
resale_test$MLR_PREDICT)
rmse_mlr[1] 89310.01
13.3 RMSE Results (Table)
The RMSE results for each model can then be stored as a table format for reference using the head() function.
rmse_mlr <- rmse(resale_test$RESALE_PRICE, resale_test$MLR_PREDICT)
rmse_gwrf <- rmse(resale_test$RESALE_PRICE, resale_test$GWRF_PREDICT)
rmse <- data.frame(
Model = c("Multiple Linear Regression", "Geographically Weighted Random Forest"),
RMSE = c(rmse_mlr, rmse_gwrf)
)
head(rmse) Model RMSE
1 Multiple Linear Regression 89310.01
2 Geographically Weighted Random Forest 84847.19
13.4 Visualising and Analysing the Model Prediction Results
Additionally, a scatter plot can be used to visualise the actual resale price and the predicted resale price of both models by using the code chunk below.
A better predictive model should have the scatter point close to the diagonal line. The scatter plot can be also used to detect if any outliers are in the model.
plot_mlr <- ggplot(data = resale_test,
aes(x = MLR_PREDICT,
y = RESALE_PRICE)) +
geom_point() +
ggtitle("Actual Resale Price vs Predicted Price (MLR)")
plot_gwrf <- ggplot(data = resale_test,
aes(x = GWRF_PREDICT,
y = RESALE_PRICE)) +
geom_point() +
ggtitle("Actual Resale Price vs Predicted Price (GWRF)")
ggarrange(plot_mlr, plot_gwrf, ncol = 2)
RMSE Comparison:
- The RMSE for Multiple Linear Regression (MLR) is 89310.01, while for Geographically Weighted Random Forest (GWRF), it is 84847.19. The lower RMSE for GWRF indicates better predictive performance compared to MLR, suggesting that incorporating spatial information improves prediction accuracy.
Scatter Plot Analysis:
- Both models show scatter points roughly aligned along the diagonal, indicating reasonable predictive capability. The GWRF model has a tighter clustering of points closer to the diagonal, implying better alignment between actual and predicted resale prices.
Model Interpretation:
The GWRF model demonstrates superior performance due to its ability to account for spatial heterogeneity and relationships, as seen in the reduced RMSE and improved scatter alignment.
This aligns with the observed spatial autocorrelation in the dataset, highlighting the value of using a geographically weighted model.
Recommendation:
- For more reliable results for a model, the GWRF should be preferred for tasks requiring precise predictions of resale prices, especially in spatially non-uniform datasets with evident spatial autocorrelation.
14 Conclusion from results
The objective of this exercise was to predict HDB resale prices effectively while accounting for spatial relationships in the data. By comparing the performance of Multiple Linear Regression (MLR) and Geographically Weighted Random Forest (GWRF) models, it was evident that the GWRF model outperformed the MLR model. This was reflected in the lower Root Mean Square Error (RMSE) achieved by the GWRF model, which stood at 84847.19 compared to 89310.01 for the MLR model. The improved accuracy of the GWRF model underscores the importance of addressing spatial heterogeneity when modelling geographically distributed data.
The dataset used in this exercise exhibited signs of spatial autocorrelation, necessitating the use of spatially-aware models. GWRF effectively leveraged spatial relationships to capture local variations in factors influencing resale prices. The use of an adaptive bandwidth further optimized the model by allowing the kernel to adjust dynamically based on data density, thereby enhancing its performance. Additionally, the analysis of variable importance highlighted that factors such as REMAINING_LEASE, proximity to the central business district (CBD), and FLOOR_AREA_SQM were the most influential drivers of resale prices. These insights underscore the importance of focusing on these attributes in future analyses or pricing strategies.
Visual validation through scatter plots of predicted versus actual resale prices demonstrated that the GWRF model produced predictions that aligned more closely with observed values compared to the MLR model. The points in the GWRF scatter plot clustered nearer to the diagonal, further reinforcing its superior predictive capabilities. From a practical perspective, for scenarios involving non-uniform spatial data with evident spatial autocorrelation, the GWRF model proves to be a more robust and reliable choice compared to traditional regression techniques. Its ability to incorporate local variability and spatial relationships makes it particularly well-suited for predictive modelling of geographically distributed data.
In conclusion, this exercise demonstrates the importance of integrating spatial methodologies in real estate pricing. The findings highlight the value of geographically weighted modelling approaches in addressing spatial heterogeneity and improving prediction accuracy. Future analyses should continue to leverage spatially-aware models such as GWRF while refining bandwidth parameters and incorporating additional spatial covariates to further enhance performance. The generalizability of this approach should also be evaluated on other property types to assess its broader applicability.
15 References
Kam, T.S. (2024). Calibrating Hedonic Pricing Model for Private Highrise Property with GWR Method
Kam, T.S. (2024). Geographically Weighted Predictive Models
IS415 Megan Sim Take-Home Exercise 3(2021). Retrieved from https://is415-msty.netlify.app/posts/2021-10-25-take-home-exercise-3/